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1. INTRODUCTION 

Automatic speech recognition is a computer technique that is commonly used in several systems. In 
car systems, speech recognition identifies the driver's voice to start responding to their commands such as 
playing music, activating global positioning system (GPS), launching phone calls, and selecting radio 
stations. In the field of language education, speech recognition can teach proper pronunciation and help 
people to develop their oral expression, and also it facilitates education for blind students. In this context, the 
research proposed a method to identify the speaker's voice using adaptive orthogonal transformations [1] and 
comparing it with the method of mel-frequency cepstral coefficients (MFCCs) [2]-[4]. 

In order to identify the speaker’s voice several methods are used to extract the special features of 
each voice, among them mel-frequency cepstral coefficients. Although numerous researchers chose it as their 
feature extraction method because of its several advantages [5], [6], it reaches its limit in the improvement of 
automatic speaker recognition system as described by references [7]-[9]. It needs a large voice training 
dataset and a long execution time to identify the voice of each speaker [10] and the same goes for other 
approaches such as principal component analysis (PCA), discrete wavelet transform (DWT) and empirical 
modal decomposition (EMD) as revealed by reference [11]. 

Janse et al. [12] presented a comparative study between mel-frequency cepstral coefficients and 
discrete wavelet transform, where it mentioned that MFCCs values are not very robust in the presence of 
additive noise and that DWT requires a longer compression time. Winursito et al. [13] combined MFCCs 
with data reduction methods with the aim of improving the accuracy and increasing the computational speed 
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of the classification process by decreasing the dimensions of the feature data. The data reduction process is 
designed in two versions: MFCC+SVD version 1 and MFCC+PCA version 2. The results showed a 
performance improvement for the proposed approach. Wang and Lawlor [14] proposed a method for a 
speaker recognition system by combining MFCCs with back-propagation neural networks. It revealed that 
this approach works successfully only when the number of unfamiliar speakers is not too large. 

From these research studies, we deduced that many authors have used MFCCs as a feature 
extraction approach and to strengthen their methodologies, they have used other approaches in the 
classification process to obtain the desired results. In addition, other authors have developed new methods by 
addressing the limitations of MFCCs in order to obtain an improved algorithm, which is not sensitive to 
noise, and has a fast execution time. The goal of this study is to solve the problems mentioned above by 
developing a fast algorithm based on adaptive orthogonal transformations for the extraction of the 
informative features from the voice signal using the smallest possible training dataset, inspired by references 
[1], [15]. This paper is organized as follows: Section 2 describes the new approach of orthogonal operators, 
then the comparison results obtained between MFCCs and the proposed method are discussed in section 3, and 
finally section 4 concludes the paper. 


2. RESEARCH METHOD 
2.1. Pre-processing 

Before starting to apply the proposed approach, it is first necessary to pre-process the signals as 
shown in Figure 1. This involves firstly removing silence, then secondly detecting the beginning and the end 
of the speech by using the zero-crossing rate (ZCR) [16]-[19]. The third step is making them equal in length 
by using zero padding [20], [21] because the training dataset can contain several signals that do not have the 
same length. The final step is compressing their size without losing quality to avoid the problem of system 
slowness by using Fourier transform [22], [23] or correlation [24], [25]. Figure 2 shows the input signal 
before and after removing silence with detection of the beginning and the end of speech. Figure 3 shows the 
speech signal after applying the Fourier transform method to detect the informative intervals. Figure 4 shows 
the speech signal after applying correlation to detect the informative intervals. 
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Figure |. The pre-processing part 
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Figure 2. The input signal before and after removing silence with detection of the beginning and the end of 
speech 
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Figure 3. The speech signal after applying the Fourier transform 
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Figure 4. The speech signal after applying correlation 


2.2. Theoretical background 

Our approach consists in searching the informative features of the signal by using the operator H, 
which is a matrix operator of the transform (dimension NxN) whose number of rows corresponds to the 
number of basic functions. To decompose the vector X, the calculation of the discrete spectrum Y with the 
numerical methods can be represented by the following matrix [1], [15], [26]: 


Y= Ż HX (1) 


where 
X = [x1 Xo, +5 xy]” is the initial signal to be transformed (size N = 2”). 
Y = [y1, Y2, --, Yn] is the vector of the spectral coefficients, calculated by the operator orthogonal H. 

The calculation of the spectrum Y by using (1) requires N? multiplication and addition operations. 
The most efficient way to reduce the number of operations is to use a sparse matrix [27] where most of its 
elements are zero, which will make the calculation and execution time of the algorithm faster. The method of 
Good [28] which is used in the construction of fast transformation algorithms consists in expressing the 
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orthogonal spectral operator H as a product of sparse matrices G; composed by minimum dimensional 
matrices called spectral kernels, where these matrices take the following form: 


_ [cos (æj) sin (aij) 
Viile) = sin (æ; j) —cos (æ; j) . 


with a e [0,27]. 
Then G; will be written as (3): 


Via 0 ae 0 
0 Vi2 0 0 
Gi = 0 s) 0 (3) 


0 0 0 Vw, 


with i=1...n, n = log} N which is the number of matrices G;, then each matrix G; contains Z spectral 
kernels of V; j(æ; j) of dimension 2 x 2 . 
So, the formula of Y will be: 


E E E S 


The algorithm goes through a procedure of adaptation of the operator H to a class of input signals. It 
consists in calculating the average of the statistical features at the pre-processing part (section 2 part 1) to 
form the standard vector Rq. We can say that the operator H is adapted to a class of signals represented by a 
standard vector R,4 if it verifies the following condition: 


—HaRsa = Y; = [Yea 0, 0]? with yey # 0 (5) 


where Y, is the target vector that constructs the adaptation criterion of the operator H, to R,q . 
The target vector Y; is calculated as (6): 


Y; = GiYi-1 (6) 


with i = 1...log, N and Yy = Êsa. 
In a simplified way, the synthesis procedure of the operator of orthogonal transformation is as follows: 
For i = 1, Y, = G,R,q with Y, contains = non-zero number of elements. 


; ; . N 
For i = 2, Y, = GY, with Y, contains zz on-zero number of elements. 


A ; . N 
For i = n, Yp = Y; = GnYn-1 with Y, contains zn NON-Zero number of elements. 


Then the calculation of the orthogonal spectral operator is: 
Ha = GnGn-1 = Gy (7) 


Figure 5 shows the overall process to extract the informative features from the input signals by using 

our approach: 

As shown in Figure 5, the extraction of the informative features consists of 7 steps: 

— Step 1: Input signals go through a pre-processing process (section 2 part 1). 

— Step 2: We calculate the average of the statistical features obtained during the pre-processing part (using 
Fourier transform and correlation). 

— Step 3: The operator synthesis algorithm is applied to the average of the statistical features obtained from 
the previous step. 

— Step 4: The output of the algorithm is the adaptive operator H. 

— Step 5: The projection multiplication is applied between the operator H and the rest of the statistical 
features. 

— Step 6: The result of the previous operation is a set of informative features that characterize each signal of 
the class. 

— Step 7: The average of the feature vectors is calculated. The result is an informative feature vector with a 
minimum dimension that characterizes the whole class. 
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Figure 5. The overall process to extract the informative features from the input signals by using the adaptive 
orthogonal transform method 


2.3. Datasets 

The speech signals are recorded and filtered by the Audacity program. Each signal is recorded by 
default at 22 kHz with a duration between 1 s and 2 s. The training dataset contains 10 classes, where each 
class contains 100 voice recordings of the speaker (Class 1 of Speaker A contains 100 of his voice 
recordings, the same applies to Class 2 of Speaker B up to Class 10 of Speaker J). 

The test dataset contains 6000 voice recordings of speaker A, B, ..., J and other unfamiliar speakers. 
To test the similarity between the speaker’s voice in the training dataset and the speaker’s voice in the test 
dataset, dynamic time wrapping (DTW) is used. Dynamic time wrapping or DTW [29]-[32] consists in 
comparing two voice signals by considering the Euclidean distance between the two vectors obtained by the 
applied method, which is defined by (8): 


Di = VXI — bi)? (8) 


with 

D; : The distance between the i vector of the spectrum a and the i vector of the spectrum b. 

n : Dimension of a and b spectrum. 

Therefore, the vector a; will correspond to class i if D; = min (Di=1..c) where C is the number of classes. 


3. RESULTS AND DISCUSSION 
The quality of recognition is measured by calculating the recognition rate which is defined as (9). 


the recognized speakers count “ 


Rate = 100 (9) 


the test dataset size 


Tables 1 and 2 show the voice recognition rate according to the size of the interval of the analysis. From 
these tables, we observe that the adaptive orthogonal transform method gives good results compared to the 
MFCCs approach. As mentioned in section 2 part 1, we used correlation and Fourier transform to work only 
with the informative intervals of the signal instead of working with the whole signal. As we can see there is a 
47.5% difference in voice identification rates with Fourier transform intervals between using our approach 
(96.8%) and MFCCs (49.3%). On the other hand, we found a 45.0% difference in voice identification rates 
with correlation intervals between using our approach (98.1%) and the MFCCs (53.1%). Correlation intervals 
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give us better results than Fourier transform intervals, either for our approach or for the MFCCs. The 
proposed method has succeeded in identifying 5886 voice recordings among 6000 voice recordings of the 
test dataset (rate 98.1%) compared to MFCCs that identified only 3186 voice recordings (rate 53.1%), and 
these results show the efficiency of our algorithm. 


Table 1. The voice recognition rate according to the size of the interval using Fourier transform with MFCCs 
and the adaptive orthogonal transform method. 


Size of interval using Fourier The voice recognition rate with The voice recognition rate with the adaptive orthogonal 
Transform MFCCs (%) transform method (%) 
128 23.8 68.9 
256 32.8 75.3 
512 35.6 82.3 
1024 38.5 91.5 
2048 45.6 93.1 
4096 49.3 96.8 


Table 2. The voice recognition rate according to the size of the interval using correlation with MFCCs and 
the adaptive orthogonal transform method 


Size of interval using The voice recognition rate with MFCCs The voice recognition rate with the adaptive orthogonal 
Correlation (%) transform method (%) 
128 25.6 65.8 
256 30.3 73.2 
512 37.3 81.7 
1024 43.2 90.2 
2048 49.1 95.6 
4096 53.1 98.1 


4. CONCLUSION 

MFCCs is one of the most usable and well-known methods in the field of signal processing. 
However, it needs a large training dataset and a long execution time to extract the important features if the 
number of test dataset unfamiliar speakers is large, so for these reasons we developed a new method based on 
the creation of the operator H which is adaptable to any input signal. Even though its creation goes through 
several iterations log, N iterations where N is the length of the signal, an advantage of working with a sparse 
matrix where most of its elements are zero is that it makes the calculation and execution time of the 
algorithm faster. Our future goal is to increase the number of voice recordings in the test dataset and to 
decrease the number of voice recordings in the training dataset to see if the method continues to give 
successful results or not. In addition, we will combine it with other methods that are commonly used as 
classification methods such as hidden markov model (HMM) and artificial neural networks (ANN). 
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